DATAX121-23A (HAM) & (SEC) - Introduction to Statistical Methods
We often check the second and third assumptions after fitting any regression model to the data
For simple linear regression: The R function, lm(), is “saying” the following equation, the best-fit line, is appropriate for the data
\[ \begin{aligned} y_i = &~\beta_0 + \beta_1 \times x_i + \varepsilon_{i}, \\ &~ \text{where} ~ \varepsilon_{i} \sim \text{Normal}(0, \sigma_\varepsilon) \end{aligned} \]
By fitting the model first, we note that the residuals, \(\varepsilon_{i}\), are what we want to have a similar spread after accounting for the best-fit line, and we want to be Normally distributed
\[ \begin{aligned} y_i = &~\beta_0 + \beta_1 \times x_i + \varepsilon_{i}, \\ &~ \text{where} ~ \varepsilon_{i} \sim \text{Normal}(0, \sigma_\varepsilon) \end{aligned} \]
where:
The best estimate of the \(\beta_j\)s are the \(\widehat{\beta}_j\)s after fitting this equation to the data.
The above is achieved when the sum of squares for residuals, \(SSR\), is minimised. which after a few more mathematical steps, gets us out best estimate of \(\sigma_\varepsilon\), which is \(\sqrt{MSR}\)
This diagnostic plot is a scatter plot known as the Residuals versus Fitted plot
If the simple linear regression model is appropriate for the data:
As per Slides 3–4, it is the residuals that need to be approximately Normally distributed
We should only check this assumption if the second assumption has been satisfied
If the simple linear regression model is appropriate for the data:
So for CS 10.1 we would trust the inference made with the simple linear regression model
This diagnostic plot is a scatter plot known as the Normal Q-Q plot
For any linear model that assumes the residuals are approximately Normally distributed: The values of observed residuals (y-axis) agree with their theoretical values (x-axis)
A pair of researchers were interested in whether the world record times for an Ironman changed over time. To investigate this question, they collected data on whenever an athlete broke a world record since 1989. The data contains the following variables for the eight athletes that broke the world record.
| Variables | |
|---|---|
| Time | A number denoting the world record time (in minutes) |
| Year | A factor denoting the year in which the world record was broken |
Describe any features of the data
There is a negative linear relationship between the world record times for an Ironman and when the world record was broken. It is difficult to comment on the strength of the relationship with only eight observations.
The equation of the fitted model?
\(\text{Time}_{i} = \beta_0 + \beta_1 \times \text{Year}_i + \varepsilon_{i}\), where \(\varepsilon_{i} \sim \text{Normal}(0, \sigma_\varepsilon)\)
No details about how the data was provided. Also, is it believable that each observation is associated with a unique individual? I think it is not in this case, as people who break world records tend to break them again after a short time.
Typically for small datasets, justifying this assumption has been met is very difficult. After thinking about it, it should be clear that the residuals are not randomly scattered evenly about 0.
We usually do not bother with approximate normally distributed residuals if the previous assumption has not been made, as things can look Normal if we “collapse” the variable. The central component of the Normal Q-Q plot is a bit off, which does suggest that the shape of the residuals’ distribution may not be exactly Normal, so we should investigate further with a histogram of residuals.
β1
With 95% confidence, we estimate a one-year increase decreases the average Ironman world record time by somewhere between 0.74 and 1.42 minutes.
β0
With 95% confidence, we estimate that the average Ironman world record time by somewhere between 1955.83 and 3312.37 minutes at Year 0 (A.D.).
Is the above interpretation sensible though?…
Call:
lm(formula = Time ~ Year, data = ironman.df)
Residuals:
Min 1Q Median 3Q Max
-6.6910 -2.0735 0.5486 1.9747 6.7607
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2634.1026 277.1940 9.503 7.74e-05 ***
Year -1.0815 0.1381 -7.834 0.000229 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.464 on 6 degrees of freedom
Multiple R-squared: 0.9109, Adjusted R-squared: 0.8961
F-statistic: 61.36 on 1 and 6 DF, p-value: 0.0002286
2.5 % 97.5 %
(Intercept) 1955.83343 3312.37181
Year -1.41932 -0.74368
Multiple R2
Our fitted model explains 91.09% of the variability in the Iron world record times
For when Yeari is 2021
We are 95% sure that the average Ironman world record time made in 2021 is somewhere between 442.5 and 454.3 minutes.
With 95% confidence, we predict that an Ironman world record time made in 2021 is somewhere between 436.0 and 460.8 minutes.
For when Yeari is 2023
We are extrapolating here, so any inference for the response assumes that the relationship between the response and explanatory variables remains linear
Year2021 <- data.frame(Year = 2021)
Year2023 <- data.frame(Year = 2023)
## A 95% CI for a mean response and 95% PI for an individual
## response given that the year is 2021
predict(ironman.fit, newdata = Year2021, interval = "c") fit lwr upr
1 448.391 442.4791 454.3028
fit lwr upr
1 448.391 435.9706 460.8113
## A 95% CI for a mean response and 95% PI for an individual
## response given that the year is 2023
predict(ironman.fit, newdata = Year2023, interval = "c") fit lwr upr
1 446.228 439.7894 452.6665
fit lwr upr
1 446.228 433.5484 458.9075